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Abstract 

In many data mining applications collection of sufficiently large 
datasets is the most time consuming and expensive. On the other 
hand, industrial methods of data collection create huge databases, and 
make difficult direct applications of the advanced machine learning al- 
gorithms. To address the above problems, we consider active learning 
(AL), which may be very efficient either for the experimental design 
or for the data filtering. In this paper we demonstrate using the on- 
line evaluation opportunity provided by the AL Challenge that quite 
competitive results may be produced using a small percentage of the 
available data. Also, we present several alternative criteria, which may 
be useful for the evaluation of the active learning processes. The au- 
thor of this paper attended special presentation in Barcelona, where 
results of the WCCI 2010 AL Challenge were discussed. 
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1 Introduction 

Traditional supervised learning algorithms use whatever labeled data is pro- 
vided to construct a model. By contrast, active learning gives the learner a 
degree of control by allowing the selection of labeled instances to be added 
to the training set [1J. One of the most popular methods of AL is uncertainty 
sampling, where the learner queries the instance about which it has the least 
certainty [2], or query-by- committee, where a "committee" of models selects 
the instance about which its members most disagree [3]. 
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Based on our experience, we can divide the AL process into three main 
periods: 1) initial, which may be characterised by a high level of volatil- 
ity, because of the lack of information; 2) actual, during which significant 
progress may be made, and 3) validation, which may be implemented, for 
example, using random sampling. 

We claim that any intermediate results during an initial period are not 
important, because these results are based on insufficient evidence. However, 
the outcome of the initial period is significant as it represents a starting point 
for the following most important period of actual AL. 

During the second period of actual AL, it is essential for the learning sys- 
tem to demonstrate speed and consistency of the improvement, and we can 
use a variety of methods to check convergence of the learning process. After 
we have found that the results are satisfactory, we can collect at random 
a large amount of labels and we can evaluate the learning trajectory (see 
Figures [3] and SJ). Rows "AUC", "ALC" and "AUC-rand", "ALC-rand" of 
Table [1] demonstrate similarity between the final (official) results and the 
corresponding results, which were evaluated using random sampling. 

AL may be implemented through very interesting and sophisticated 
methods [H [5] . Unfortunately, it is most unlikely that these methods would 
be very efficient in this particular Challenge because of the problems with 
the evaluation criterion (see Section [3] for more details). 

Some interesting papers including report of the Organisers of the WCCI 
2010 AL Challenge [6] maybe downloaded for free from the web-site of the 
Journal of Machine Learning Research^- 

2 Task Description and Methods 

In many applications, including handwriting recognition, chemo-informatics, 
and text processing, large amounts of unlabeled data are available at low 
cost, but labeling examples (using a human expert to find the corresponding 
labels) is tedious and expensive. Hence there is a benefit either to use 
unlabeled data to improve the model in a semi supervised learning algorithm 
[7J E], or to sample efficiently and to use human expertise for labeling only 
the most informative examples. 

Six tasks named A, B, C, D, E, and F of pool-based active learning were 
considered during the AL Challenge, in which large unlabeled datasets are 
available from the outset of the Challenge and the participants can place 
queries to acquire data for some amount of virtual cash. The participants 

1 http://jmlr. csail.mit.edu/proceedings/papers/vl6/ 
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Figure 1: Sorted decision functions corresponding to the set D: (a) step 2; (b) step 
5; (c) step 12 and (d) step 16. 

need to return prediction values for all the labels every time they want to 
purchase new labels. This allows the Organisers to draw learning curves 
of prediction performance versus the amount of virtual cash spent. The 
participants are judged according to the area under the learning curves ([3]). 

2.1 Uncertainty Sampling 

At the beginning of the AL Challeng^l, we were given only one labeled (pos- 
itive) sample. We decided to use an assumption that the data are highly 
imbalanced with smaller number of positive instances. Accordingly, we con- 
sidered for the first step 50-100 random sets with given positive sample and 
other samples (assumed to be negative), which were selected randomly, but 
under condition that they are sufficiently distant from the given positive 
instance. The decision function was calculated as a sample average. We 
continued to use similar random sampling in the further steps, but the pro- 
portion of the randomly selected instances was smaller. After 4-6 steps we 
stopped using random sampling. The query for a new label was carried 
out according to the structure of the decision function, which was sorted in 
increasing order; see Figure [H 

We decided that the most appropriate selection of the samples in order 
to make a query is the range, where the decline of the decision function is 

2 http://www.causality.inf.ethz.ch/activelearning.php 
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Table 1: Some statistical characteristics related to the Sets A-F and to the meth- 
ods, which we used during the AL Challenge. 



DataSet 


A 


B 


C 


D 


E 


F 


Sample size 


17535 


25000 


25720 


10000 


32252 


67628 


No of submissions 


13 


17 


16 


17 


13 


13 


Used samples 


811 


1101 


2301 


531 


661 


900 


Percentage 


4.63% 


4.40% 


8.95% 


5.31% 


2.05% 


1.33% 


Last weight 


4.32 


4.51 


3.48 


4.24 


5.61 


6.23 


Validation 


900 


1200 


2000 


1000 


800 


1999 


Percent (positives) 


30.95% 


17.98% 


16.71% 


39.36% 


18.76% 


27.86% 


AUCi 


0.5177 


0.6304 


0.4988 


0.5098 


0.5613 


0.6179 


AVC 


0.8847 


0.7323 


0.7766 


0.9404 


0.7457 


0.9853 


ALC 


0.4775 


0.2834 


0.2378 


0.602 


0.3689 


0.6517 


ALC 2 


0.5178 


0.3941 


0.3415 


0.5874 


0.3761 


0.7456 


Percent (positives) -rand 


20.89% 


8% 


8.85% 


25.7% 


11.75% 


7.25% 


AVC - rand 


0.8929 


0.7326 


0.7854 


0.9334 


0.7501 


0.9843 


ALC - rand 


0.4682 


0.2915 


0.2367 


0.6038 


0.3417 


0.6398 



changing from the rapid to smoothed. As a consequence, the fractions of 
the positive samples in our training sets were significantly higher compared 
to the fractions of the positive samples in the validation sets, which were 
collected randomly; see rows "Percent (positive)" and "Percent (positive) - 
rand", Tabled) 

During the initial few steps, we used as a classifier the kridge function 
from the CLOP package. After collecting more labels, we started experi- 
ments with other functions: neural (also from CLOP package); GLM, ADA, 
and GBM from R. Also, we conducted feature selection using the Wilcoxon 
criterion for the Sets A, B, C and E; and a special likelihood-based criterion 
(binary) was used in application to the Set D. Below the level of 200-300 
of the labeled samples, we used LOO for the evaluation and optimisation of 
the used parameters. Then, we conducted experiments using CV with 10-20 
folds. The final decision function was computed as an ensemble of the base 
decision functions, where particular weights were defined according to the 
results of the evaluations with CV. 

On the final phase of our experiment, we collected at random sufficiently 
large sample for the evaluation of our learning curve (the sizes of the vali- 
dation samples are given in the row "Validation" , Table [I]) . 

2.2 An ensemble constructor (a general approach) 

Definition 1 An ensemble is defined as heterogeneous if the base models in 
an ensemble are generated by methodologically different learning algorithms 
(we shall consider such an ensemble in this Section). On the other hand, an 
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ensemble is defined as homogeneous if the base models are of the same type 
(for example, boosting or random forest). 

Suppose, we have two high quality solutions, which are very different. 
Obviously, a direct sample average will not be efficient in this particular 
case. 

The following very simple Matlab code may be very useful in order to 
adjust one solution to the scale of the other solution without any loss of 
quality. 

[al,f l]=sort(xl) ; - solution Nl 
[a2,f 2]=sort(x2) ; - solution N2 
[b2,g2]=sort(f2) ; 
x3=xl; 

for i=l : size (xl , 1) , 
ii=g2(i); 

x3(i)=al (ii) ; - adjusted solution 

end; 

After the above adjustment, we can compute an ensemble solution as a 
linear combination: 

x ens = t ■ xi + (1 - r) • x 3 (1) 

of the input solutions x\ and x 3 , where < r < 1 is a positive weight 
coefficient. Clearly, the stronger performance of the solution x\ compared 
to X3, the bigger will be the value of the coefficient r. 



3 ALC Criterion 

Suppose, we have a sequence: rii, i = 1, . . . , N - to be the sizes of the 
requests, where n± = 1; ij = Y?j=x n j ' t ne corresponding training sizes. 
By definition, the area under learning curve (ALC) is 

2 N t 
ALC = — £ AUd log 2 -^±1 - 1, (2) 

where 

m , AUC l + AUC M 
2 

AUCn+i = AUC^,tN+i = T is the total available sample size (see row 
"Sample size" in Table [1]). 
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Figure 2: Behavior of the weight function w, which is defined in ([3]). 

In order to simplify further notation, suppose that rij = n 2 + l,i = 
3, . . . , N, N > 3. Then, the criterion ALC, which is given by equation (J2j) 
may be rewritten in the following form, 



_ N-l rp 

ALC ~ MJCx log 2 (n 2 + 1) + V AUC iWi + AUC N log 2 — . 

1^2 tN ' 



(3) 



where 



Wi = log 2 (1 + - — -), 
i — 1 



and we ignore in ([3]) the linear and shift coefficients as they do not make 
any difference to the classification of the learning curves. 

Figure [2] illustrates the rapid decline in the coefficients Wi, i = 2, . . . , N. 
But the real winner in this particular Challenge was the first coefficient w\ = 
log 2 (n 2 + 1). In all our submissions, we used n 2 = 50, which corresponds to 
w\ w 5.67. Values corresponding to the last coefficient 

w N = log 2 — 
in 
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Figure 3: Trajectories in terms of AUCs corresponding to the Sets A-F. 
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Figure 4: Trajectories in terms of ALC's corresponding to the Sets A-F. 
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Figure 5: ALCs as a function of the first entry (see for more details Section l3T2|) . 
The exact values of the lowest points (blue dotted lines) are given in the row 
"ALC 2 ", Tabled] 

are presented in the row "Last weight" , Table [TJ Only in the case of Set 
F, where we used 1.33% of all available data, the last term is slightly more 
important compared to the first term. In all other cases the first term is the 
most important. Given that we did not use more that 9% of all available 
data, this fact appears to be very surprising. 

During the AL Competition, we acted in accordance with the natural 
logic (learn carefully with small steps). We had no time to practice with the 
development sets before the Competition in order to discover some essential 
features of the criterion ([2]) . 

3.1 Illustration 

Figures E^F) and EtF) illustrate the fact that the first two values (vertical 
points) of the trajectories corresponding to the set F are equal. This hap- 
pened accidentally because we submitted solution from the wrong directory. 
In this particular Challenge such a mistake was very serious. In order to 
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check sensitivity of the ALC evaluation criterion, we replaced the second 
value by the average of first and third values. As a result, the ALC grew 
from 0.6517 to 0.6842. 

3.2 Special binary case: N=2 

After the results of the competition were released, we noticed that binary 
strategies with only two entries (or with the first step as a very big jump) 
were very popular. The motivation for such an approach is very simple. 
As it discussed in Section [31 it is extremely important to maximise AUCi, 
because of the given evaluation criterion ([2]). 

One way to do this is through the cooperation between different teams 
(formally it is restricted, but cannot be controlled with full certainty). The 
second way is to collect at random a very large sample, for example, 5 — 10% 
of all available data (to ensure sufficiently large value of AUCi)- Probably, 
that was the best choice during this Challenge. In fact, we have found that 
the criterion ([2]) does not encourage, but discourage the process of actual 
AL. 

Let us consider more detailed illustration for the above fact. In the case 
N = 2 we can rewrite (|2]) in the following simplified form: 

= 1^) ( AUCl+ 2 AUC > >° fe W + auc .o g £) _ L (4) 

Figure [5] illustrates behaviour of ALC2 as a function of AUCi, where the 
left (smallest) horizontal point corresponds to our initial AUC-score, values 
of £2 are given in the row "Used samples" , values of AUCi are given in the 
row "AUC", TableE 

Remark 2 We were able to observe that such dramatic simplification of 
the strategy led to improvement of the winning ALC score for Set B (we 
do believe that we will be able to reproduce our final result of 0. 7323 using 
1101 randomly selected labeled instances). All results corresponding to the 
binary strategy with our initial and final AUC scores are given in the row 
"ALC2 ", Table [7J where only in the case of Set D the result was slightly 
worse compared to our final result for Set D. 

Remark 3 Normally, it is unlikely to expect that an initial A UC score based 
on only one labeled point will be better than 0.55. We do believe, if an initial 
result were greater than 0.65, there must be some problems with the data 
or competitor who generated such result used some additional information, 
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which was restricted by the rules of the Challenge. Our initial A UC-scores 
are given in the row "AU C\ ", Table Q]. 

3.3 Optimal size of the request in the binary case 

It will be more convenient to rewrite equation in the following form 

ALC 2 (t) ~ ^^{AUd - AUC 2 {t)) + AUC 2 {t) log 2 (T), (5) 

where ALC 2 (t) is an increasing function of 1 < t < T: AUC\ = AUC 2 (l). 

Suppose that based on the arguments and considerations of the previous 
section we decided to select a binary strategy, and our task is to maximise 

Figure [3] illustrates that after some point the behaviour of the AUC 
graphs become nearly stable, and there are no sense to go further, because 
the above target function ([5]) will decline slowly at the logarithmic rate. 

The main question is how to define the optimal stopping point. In the 
case where the competitor follows correctly the second period of actual AL 
with small steps, this problem will be solved naturally by comparing the 
current and several previous solutions. But the competitor will face an ex- 
tra penalty as a result of the small steps during initial period. So in this 
particular Challenge the random selection of the 5% — 10% of the available 
labels will be fine and the most appropriate ("the first step as a big jump"). 
As we can see, the competitor is strongly encouraged to avoid the most im- 
portant period of actual AL: just make a big jump, and there are absolutely 
no any sense to do anything after that. 

It is most unlikely that the Organisers would have used the evaluation 
criterion ([2]) if they had had this paper in hand before. 

4 Two Proposed Evaluation Criteria 

4.1 Formulation of the framework related to the modified 
criterion 

Our target is to produce better classification accuracy with a smaller number 
of labeled examples. 

The learning process includes two subintervals (in accordance to the 
number of the used labels): 

1) The first subinterval in which the number of labeled instances is less 
than 5 (for example, 8 may be 1% of all data available and (anyway) not 
more than 200). This subinterval will not be counted for the Challenge. 
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Figure 6: The same trajectories as in Figure evaluated with criterion (j6|). 

2) The second subinterval of the actual AL after S. This subinterval will 
be counted for the Challenge. 

4.2 Motivation 

In an industrial sense, any decision making based on a small number of 
labeled instances cannot be regarded as a serious. Any results, which are 
based on a small number of labeled instances, have no sufficient grounds to 
be implemented (and, probably, may only give some directions for further 
studies). 

The main idea: it is not important how the participant will grow to 
the level of 5, the actual learning and savings will be started after that. 



4.3 Second proposed criterion 

Let us consider 

Q 



5 • AUd 
max — — , 

i=i,...,N o + a ■ max (0, — o) 



(6) 



where 5 > and a > are regulation parameters. 



11 



Example 

Similarly as in Section 14.11 we can select the threshold 5 in §§§ as 1% of T : 
5 = 0.01 -T. 

Suppose that A and B are the expected values of AUC corresponding to 
5 and A, where A is 20% of T : A = 0.2 • T (for example, B = 1.5 A). 
The following equation follows from (0), 



Note that all graphs in Figure [6] were computed using the same regulation 
parameters as in this section. 

Remark 4 The criterion (0|) will impose equal penalties on any solution 
made within the level of 5. After that level, the penalty will grow. However, 
the quality of the solution will grow as well, and the task is not to stop 
earlier, because using "big jumps" the competitors will face the risk to miss 
an optimal point. Therefore, criterion ^) will encourage actual AL with 
many steps, which must be reasonably small. 

5 Concluding Remarks 

As it was noticed in [9], if the improvement of a quantitative criterion such 
as the error rate is the main contribution of a paper, the superiority of a new 
algorithms should always be demonstrated on independent validation data. 
In this sense, an importance of the data mining contests is unquestionable. 
The rapid popularity growth of the data mining challenges demonstrates 
with confidence that it is the best known way to evaluate different models 
and systems. 

While the idea of AL appears to be very promising, we think that it is 
not a quite suitable subject for data mining competitions, because of the 
difficulties to check the independence of the learning processes. As it was 
discussed in this paper (and during intensive email correspondence shortly 
after results were released), we do believe that the evaluation criterion that 
was used during the AL Challenge severely overestimated importance of the 
initial learning period. The first step (based on one labeled sample) was the 
most important. In the case where this step was unsuccessful, it was not 



A B 



5 S + a(A-5)' 



Therefore, 
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possible to compensate the loss by the further steps. On the other hand, 
if we know that the first submission was strong, we can request a large (at 
random) amount of labels and the success at the second step (and the final 
success) will be guaranteed. 

Our criticism is a constructive, and we have proposed two ways how to 
improve the evaluation criterion, which is the topic of the primary impor- 
tance for any competition. 
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